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Abstract 

Dynamic dictionary-based compression schemes are the most daily 
used data compression schemes since they appeared in the founda- 
tional papers of Ziv and Lempel in 1977, commonly referred to as 
LZ77. Their work is the base of Deflate, gZip, WinZip, 7Zip and 
many others compression software. All of those compression schemes 
use variants of the greedy approach to parse the text into dictionary 
phrases. Greedy parsing optimality was proved by Cohn et al. (1996) 
for fixed length code and unbounded dictionaries. The optimality of 
the greedy parsing was never proved for bounded size dictionary which 
actually all of those schemes require. 

We define the suffix-closed property for dynamic dictionaries and 
we show that any LZ77-based dictionary, including the bounded vari- 
ants, satisfy this property. Under this condition we prove the opti- 
mality of the greedy parsing as a variant of the proof by Cohn et 
al. 
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Introduction 



The foundational Ziv and Lempel LZ77 algorithm [15] is the basis of almost 
all the famous dictionary compressors, like gZip, PkZip, WinZip and 7Zip. 
They consider a portion of the previous text as a dictionary, i.e. they use 
a dynamic dictionary formed by the set of all the factors of the text up 
to the current position within a sliding window of fixed size. A dictionary 
phrase refers to an occurrence of such phrase in the text by using the couple 
(length, offset), where the offset is the backward offset w.r.t. the current 
position. Since a phrase is usually repeated more than once along the text 
and since pointers with smaller offset have usually a smaller representation, 
the occurrence close to the current position is preferred. 

Furthermore, in LZ77 based compression, the greedy approach is used 
to parse the text into phrases, i.e, in an iterative way, the longest match 
between the dictionary and the forwarding text is chosen. This is commonly 
called the greedy phrase. Some LZ77-based algorithms as Deflate algorithm 
and the compressors based on them, like gZip and PkZip, use variants of 
the greedy approach to parse the text. Deflate64 algorithm implemented in 
WinZip and 7zip, contains some heuristics to parse differently the text in 
order to improve the compression ratio, but its time complexity was never 
clearly stated. 

The research about dictionary-based data compression and parsing op- 
timality produced in the last decades some noticeable results. Let us recall 
some of them within a brief historical overview. 

In '73, the Wagner's paper (see [13]) shows a 0(n \D\ 2 ) dynamic program- 
ming solution for the parsing problem in the case of static dictionary, where 
n is the text length, D is the dictionary and \D\ is the dictionary cardinality, 
i.e. the number of phrases belonging to the dictionary. Dictionary phrases 
can overlap each other. 

In '74 Schuegraf et al. (see [UJ) showed that the parsing problem is 
equal to the shortest path problem on a graph associated to both a text 
and a static dictionary. Since that the full graph for a text of length n can 
have 0(n 2 ) edges in the worst case and the minimal path algorithm has 
0(V + E) complexity, we have another solution for the parsing problem of 
0(n 2 ) complexity. 

In '76 Ziv and Lempel (see [9]) introduced a new measure of complexity 
for a given text defined as the number of phrases produced by parsing the text 
with a dynamic prefix closed dictionary. This preliminary work early leads to 
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the foundational dynamic dictionary-based compression methods presented 
in [Ul US], a.k.a. LZ77 and LZ78, appeared in '77 and '78 respectively. They 
both use an online greedy parsing that is simple and fast in practice. Those 
compression methods use both an uniform (constant) cost model for the 
dictionary pointers, i.e. they use bounded size dictionaries and fixed length 
code for dictionary phrase references. The greedy approach used to parse 
the text is realized by choosing the longest match between the dictionary 
phrases and the forwarding text, scanning the text left to right, until the 
whole text is covered. After any dictionary phrase in the parsing, that can 
also be the empty word, a single plain text symbol is used. This guaranteed 
the existence of a parsing for any text and any dictionary. 

In '82, the LZSS compression algorithm, based on the LZ77 one, was 
presented (see [12J). It improves the compression ratio and the execution 
time without changing the original parsing approach. The main difference is 
that a symbol is used only when there is no match between dictionary and 
text. It uses a flag bit to distinguish symbols from dictionary pointers in 
the parsing. In the same paper Storer et al. proved the optimality of the 
greedy parsing for the original LZ77 scheme with unbounded dictionary (see 
the Theorem 10 in [12] with p = 1). 

In '84, LZW variant of LZ78 was introduced by Welch (see [H]). This is 
one of the firsts theoretical compression method that use a dynamic dictio- 
nary and variable costs of pointers. The main difference w.r.t. LZ78 is that 
the text is supposed to be composed by symbol from a fixed alphabet, knew 
in advance. The dictionary is initialized with all the alphabet symbols. This 
guaranteed that there will be always at least one dictionary phrase match- 
ing a factor of the text starting at any position. Exploiting this property, 
the parsing is composed just by dictionary phrases, without using explicit 
symbols, leading to a better compression. The LZW scheme has been very 
appreciated by the research community, indeed plenty of LZW variants have 
been presented so far. 

In '85, Hartman and Rodeh proved in [6] the optimality of the one-step- 
lookahead parsing for prefix-closed static dictionary and uniform pointer cost. 
The main point of this approach is to chose the phrase that is the first phrase 
of the longest match between two dictionary phrases and the text. In other 
words, if the current parsing cover the text up to the ith character, then it 
choose the phrase w such that ww' is the longest match with the text starting 
at position i, with w, w' belonging to the dictionary. 

In '89 and later in '92, the deflate algorithm was presented and used 
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in PkZip and gZip compressors. It uses a LZ77-like dictionary, the LZSS 
flag bit and variants of the greedy parsing. Both dictionary pointers and 
symbols are encoded by using a Huffman code. Those compression schemes 
early become so popular to be included in many communication protocol, 
commercial compression software and transmission devices. 

In '95, Horspool investigated in [7J about the effect of non-greedy parsing 
in LZ-based compression. He highlighted that using the above one-step- 
lookahead parsing in the case of dynamic dictionaries leads to better com- 
pression w.r.t. the one obtained by using the greedy parsing. Horspool 
showed some experimental results using the LZW algorithm and a new LZW 
variant that he presented in the same paper. 

In '96 the greedy parsing was ultimately proved by Cohn et al. (see 
[1]) to be optimal for static suffix-closed dictionary under the uniform cost 
model. They also proved that the right to left greedy parsing is optimal for 
prefix-closed dictionaries. Notice that the greedy parsing can be computed 
in linear time. Since the LZ77 dictionary is "assumed" to be suffix-closed, 
this is a more general result w.r.t. the previous Storer et al. one. We present 
more details about LZ77 dictionary and the suffix-closed property in the next 
section. 

In '99, Matias and Sahinalp (see [10]) gave a linear-time optimal parsing 
algorithm in the case of prefix-closed dynamic dictionary and uniform cost 
of dictionary pointer, i.e. the codeword of all the pointers have equal length. 
They extended the results given in [5], [7] and [S] to the dynamic case. Matias 
and Sahinalp called their parsing algorithm Flexible Parsing. It is also known 
as semi-greedy parsing. 

In '09, Ferragina et al. (see (5]) introduced an optimal parsing algorithm 
for LZ77-like dictionary and variable length code, where the code length is 
assumed to be the cost of a dictionary pointer. In this paper the parsing 
optimality refers to the compression optimality, i.e. the parsing which leads 
to the better compression. 

In TO, Crochemore et al. (see [2] and the extended version [3]) introduced 
an optimal parsing for prefix-closed dictionaries and variable pointer costs. 
It was called dictionary- symbolwise flexible parsing and it fits to both the 
LZ77 and the LZ78 dictionary cases. It uses a graph-based model for the 
parsing problem where each node represent a position in the text and edges 
represent dictionary phrases. Edges are weighted according to the bit length 
of the encoded length and offset pair. It works for the original LZ77 and 
LZ78 algorithms and for almost all of their known variants. Recently, a new 
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data structure called Multilayer Suffix Tree was presented (see [4]) to address 
the a weak version of the rightmost position problem, strictly related with 
the parsing optimality problem. 

The main goal of this paper is to better explain the relationship between 
the LZ77 dictionary variants and the suffix-closed property and to prove the 
optimality of the greedy parsing for all of those cases. This paper is orga- 
nized as follow. In Section [TJ we formally define the suffix-closed property for 
dynamic dictionaries and we show that any LZ77-based dictionary, including 
the bounded variants, satisfy this property. In Section [2] we prove the opti- 
mality of the greedy parsing for suffix-closed dictionaries as a variant of the 
proof by Cohn et al. 

1 Suffix-Closed Dynamic Dictionaries 

In data compression field, a dictionary is a set of finite length sequences or 
phrases. It is shared between compressor and decompressor. A static dictio- 
nary is a fixed set of phrases that does not change along the compression- 
decompression process. It is known in advance w.r.t. to the input text. The 
weakness of this model is that the dictionary does not depend by the text 
and, therefore, it cannot get adapted to it. This leads to poor compression 
results for those text having few overlap with the used dictionary. 

A dynamic dictionary is a set of phrases that can change along the 
compression-decompression process. It can be the empty set at the very 
beginning of the compression process or it can be already initialized. Subse- 
quently, it get populated accordingly to a dictionary algorithm. Usually, also 
phrase deletion are supported in order to limit the dictionary size. Given 
a text T of length n, for any point in time < i < n, we call Di the dic- 
tionary at time i of the compression or decompression process, i.e. Di is 
the dictionary after that the first % symbols of the text have already been 
processed. 

A static dictionary D is prefix-closed (suffix-closed) if and only if for any 
phrase weDin the dictionary, all the prefixes (suffixes) of w belong to the 
dictionary, i.e. suff(w) C D (pref(w) C D). For instance, the dictionary 
D = {a, ba, aba, bba} is suffix-closed. 

The LZ77 dictionary is defined as the set of factors of a portion of the 
already processed text. In other world, for any text T and at any time i, the 
dictionary is the set of factors of the text fitting to a sliding window of length 
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h, i.e. fact(T[i — h+l..i\). At any time i, the dictionary Di is both prefix- and 
suffix-closed. The LZ78 dictionary is maintained, starting from the empty 
set, by inserting a phrase formed by a symbol concatenated to the greedy 
phrase. For instance, if at the moment % the greedy phrase matching the 
text is T[i..j], i < j, then the next dictionaries Dk, i < k < +lj, are set to 
Di U T[i..j]T[j + 1]. This construction algorithm maintains the prefix-closed 
property for any Di dictionaries. 

The classic Colin' and Khazan's result of '96 (see [TJ) states that if D is a 
static suffix-closed dictionary, then the greedy parsing is optimal under the 
uniform cost assumption. Symmetrically, the reverse of the greedy parsing 
on the reversed text is optimal for static prefix-closed dictionary. Roughly 
speaking, the original proof concerns with suffix-closed dictionaries and shows 
that choosing the longest dictionary phrases guarantees to cover the text 
with the minimum number of phrases. Unfortunately, sice LZ77 and LZ78 
dictionaries are not static, above results does not apply to them. 

Let us focus on the suffix-closed definition of dynamic dictionaries. Let 
us recall that a static dictionary is a set of words D. A static dictionary 
is suffix-closed if and only if for any factor w in the dictionary D the set 
of suffixes suffiw) of w is a subset of the dictionary, i.e. suff(w) C D. 
Turning into the dynamic settings, let us say that at any moment i, > i, 
a dynamic dictionary Di is a set of words. The suffix-closed and the prefix- 
closed property have been commonly considered naturally extended, without 
a formal definition, to the dynamic case with the additional condition "at 
any time". Therefore, what is commonly meant as a suffix-closed dynamic 
dictionary is just that, at any time i, the dictionary Di has the suffix-closed 
property. 

Notice that this definition does not make any assumption on the relation- 
ship between dictionaries at two different moments and it does not suffice to 
extend the parsing optimality for static dictionary to the dynamic case. 

We define the suffix-closed property for dynamic dictionaries as follows. 

Definition 1.1. A dynamic dictionary D has the suffix-closed property iff, 
at any moment i, for any dictionary phrase w G Di and for any < k < \w\, 
the suffix Wk = w[k..\w\ — 1] of w of length \w\ — k is in Di and in D i+ k. 

Notice that the above suffix-closed property imply the natural one. 

We say that a dictionary is non- decreasing when Di C Dj for any i,j 
points in time, with i < j. A static dictionary is obviously non-decreasing. 
Practically speaking, a dynamic dictionary is non- decreasing when it can 
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only grow along the time. For instance, the original LZ78 dictionary is a non- 
decreasing dictionary because, at each algorithm step, one phrase is inserted 
into the dictionary. On the contrary, many practical implementation and 
variants of LZ78 dictionary are not non- decreasing. Because of space saving 
purpose, the size of the dictionary is bounded in practice and some phrases 
are deleted from the previou dictionary. 

Since the LZ77 dictionary is defined as the set of factors of a sliding 
window, i.e. the backward text up to a certain distance, LZ77 has not the 
non- decreasing property. 

Proposition 1.1. The original LZ77 bounded dictionary is suffix-closed. 
The unbounded variant of the LZ77 dictionary is non- decreasing and suffix- 
closed. 

Proof. Recall that, for any text T, the LZ77 dictionary is equal to fact(T[i — 
h + l.i]). The LZ77 dictionary is unbounded when h > \T\. In this case, 
for any i < \T\, the dictionary Di is equal to the set fact(T[0 : i]) that is 
obviously non- decreasing. 

Let us focus on the general set fact(T[i — h + 1 : This is a sliding 
window of size h over the text T. By the very definition of the set of factors, 
the dictionary Di is suffix-closed, at any moment i. For any value i, let be 
T[i — h + 1 : i] = au and T[i — h + 2 : i + 1] = ub with a, b in £ and u in 
S*. Since all the proper suffixes of au are also suffixes of u, then for any 
w G fact(T[i — h + 1 : i)) = Di the proper suffixes Wk of length \w\ — k, 
1 < k < \w\, are also in fact(T[i — h + 2 : i + 1]) = D i+1 . Therefore, any 
proper suffix of w G Di is also in A+i, for any i, w. It easy to see that 
this property is equivalent to the suffix-closed property defined in Definition 

rm □ 

Let us now to refer to the effect of the prefix- and suffix-closed properties 
on the graph-based model of the parsing problem in order to visualize those 
concepts. Given a text T and adictionary D, if D has the strong suffix-closed 
property, then for any edge (i, j) of the graph Gd,t associated with the phrase 
w G Di, with \w\ = j — i and w = T[i : j], then all the edges (k, j), i < k < j 
are into Gd,t- In the case of prefix closed dictionaries, as prefix edges start 
from the same node, the prefix of a dictionary phrase are all represented in 
the graph if the dictionary has just the natural prefix-closed property. 
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Figure 1: Detail of the differences between parsing O and parsing Q over a 
text T between positions \o± • • • o n \ and \o\ • • • Oh\- Nodes and dots represent 
the text and edges represent parsing phrases as reported on edge labels. 

2 Greedy Parsing Optimality 

We want now to extend the elegant proof of Cohn et al. (see [lj) to the case 
of suffix-closed dynamic dictionaries. 

Given a text T of length n and a dynamic dictionary D where, at the 
moment i-th with < i < n, the text Tj has been processed and Di is the 
dictionary at time i. Recall that we are under the uniform cost assumption. 

Theorem 2.1. The greedy parsing of T is optimal for strong suffix- closed 
dynamic dictionaries. 

Proof. The prove is by induction. We want to prove that for any n smaller 
than or equal to the number of phrases of an optimal parsing, there exists an 
optimal parsing where the first n phrases are greedy phrases. The inductive 
hypothesis is that there exists an optimal parsing where the first n—1 phrases 
are greedy phrases. We will prove that there is an optimal parsing where the 
first n phrases are greedy and, therefore, any greedy parsing is optimal. We 
use here the notation w k to refer to the suffix of w of length \w\ — k. 

Fixed a text T and a strong suffix-closed dynamic dictionary D, let O = 
oi o 2 ■ ■ ■ o p = T be an optimal parsing and let Q = g±g 2 ■ ■ ■ g q = T be the 
greedy parsing, where, obviously, p < q. 

The base of the induction with n = is obviously true. Let us prove the 
inductive step. 

By inductive hypothesis, V i < n we have that Oj = Since g n is 
greedy, then the 72,-th phrase of the greedy parsing is longer than or equal 
to the n-th phrase of the optimal parsing, i.e. \g n \ > \o n \ and therefore 
\oi - ■■ o n \ < I pi ■ --g n \. 

If \g n \ = \o n \, then the thesis follows. Otherwise, \g n \ > \o n \ and o n is the 
first phrase in the optimal parsing that is not equal the n-th greedy phrase. 
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Figure 2: Detail of the differences between parsing O and parsing Q over a 
text T between positions \o± • • • o n \ and \o\ - ■ ■ Oh\- Nodes and dots represent 
the text and edges represent parsing phrases as reported on edge labels. The 
dashed edge o\ represents a suffix of o^. 

Let h be the minimum number of optimal parsing phrases that overpass 
g n over the text, i.e. h = min{i \ \o\ • • -Oj| > \g\ - ■ ■ g n \}- Since \g n \ > \o n \, 
then h > n. If |oi • - • Oh\ = \gi • ■ ■ g n \, then the parsing g\ - ■ ■ g n Oh+i • • -o p 
uses a number of phrases strictly smaller than the number of phrases used by 
the optimal parsing that is a contradiction. Therefore \o\ - ■ ■ Oh\ > \gi ■ ■ ■ g n \- 
The reader can see this case reported in Figure [TJ 

Let \oi ■ ■ ■ o h _i\ = Tj be the text up to the j-th symbol. Then o h G Dj, 
where Dj is the dynamic dictionary at the time j. Let o\ the &-th suffix of 
0^ with k = \o\ • • • Oh\ — \gi ■ ■ ■ g n \. For the Property 11.11 of D, o\ G Dj + k and 
then there exists a parsing o\ ■ ■ ■ o n -\g n o\ph+\ • • • o p , where g n o\ = o n - ■ ■ Oh- 

From the optimality of O, it follows that h = n + 1, otherwise there exists 
a parsing with less phrases than an optimal one. See Figure [2J Therefore 
Of • • o n _ig n o k n+1 o n+ 2 ■ ■ ■ o p is also an optimal parsing. Since 0\ • • • o n _i is 
equal to g\ ■ ■ ■ g n -i, the thesis follows. □ 

Corollary 2.2. The greedy parsing is an optimal parsing for any version of 
the LZ77 dictionary. 

The proof of the above corollary comes straightforward from the Theorem 
12.11 and the Proposition 11.11 

To our best knowledge, this is the first proof of optimality of the greedy 
parsing that cover the original LZ77 dictionary case and almost all of the 
practical LZ77 dictionary implementations where the search buffer is a sliding 
windows on the text. 
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